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DATA SCIENCE 


“The goal is to turn data into 
information, and information into 
insight. 


Course Outcomes: DATA SCIENCE 


*Upon successful completion of the course, the 
students will be able to 


*CO1 Apply statistical methods to data for 
inferences. 


*CO2 analyze data using Classification, Graphical 
and computational methods. 


*CO3 Illustrate graphical analysis and hypothesis 
testing methods. 


*CO4 describe Data Wrangling approaches. 


*CO5 perform descriptive analytics over massive 
data. 
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DATA SCIENCE-Syllabus 


* UNIT - I: Introduction and Linear Regression: Statistical learning, 
Assessing model accuracy, descriptive statistics, Linear Regression: Simple 
and multiple linear regressions, k -nearest neighbor regression. 

* UNIT - II: Machine Learning: Modeling, Overfitting and Underfitting, 
Correctness, The Bias-Variance Tradeoff, Feature Extraction and Selection, 
k-Nearest Neighbors, Naive Bayes, Gradient Descent. 

* UNIT - II: Graphical Analysis & Hypothesis testing: Visualizing Data: 
matplotlib , Bar Charts, Histograms and frequency polygons, box-plots, 
quartiles, scatter plots, heat maps. 

Simple Hypothesis testing, student’s t-test, paired t and u test, correlation 
and covariance, 
tests for association. 

* UNIFIV: Data Wrangling: Data acquisition, the split-apply-combine 

paradigm, data formats, 
imputation, Cleaning and Munging, Rescaling, Dimensionality 

Reduction. 

° UNIT - V: Computational Methods and Analytical Processing: 
Programming for Eigen values and Eigen vectors, sparse matrices, QR and 
SVD, Data warehousing and OLAP, data summarization, data de- 
duplication, data visualization using CUBEs. 


DATA SCIENCE 4 


Oo UNIT - I: Introduction and Linear Regression: 
1.) Statistical learning: 


What Is Statistical Learning? 

Why Estimate f? 

How Do We Estimate f? 

The Trade-Off Between Prediction Accuracy 
and Model Interpretability 

Supervised Versus Unsupervised Learning 

Regression Versus Classification 

Problems 
2 .)Assessing Model Accuracy 

Measuring the Quality of Fit 
3.) Descriptive Statistics [|] Mean, Median, Mode, 
Variance, Standard 
Deviation. 
4.) Linear Regression: Simple and Multiple 
3.) K-Nearest Neighbour Regression. 
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Data Science 


*DATA SCIENCE is a field that involves extracting insights 
and knowledge from data using various techniques and 
tools. 


-It is a multidisciplinary field that combines statistics, 
mathematics, computer science, and domain expertise. 


*The goal of data science is to use data to make better 
decisions and predictions. It has become an essential part 
of many industries, including healthcare, finance, 
marketing, and more. 


* You should focus on the following topics to learn Data 
science 


1.) Programming 4.) Machine Learning 
2.) Statistics 9.) Hands on Projects 
3.) Data Visualization (continuous working...) 


Data Science=? 


DOM LATIN 
EXPERTISE 
STATISTICAL DATA 
RESEARCH PROCESSING 
DATA 
SCIENCE 
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Statistical learning 


° Statistical learning refers to a vast set of tools for 
understanding data. 

* These tools can be classified as supervised or 
unsupervised. 

* supervised statistical learning involves 
Classification and Regression(Prediction) 

* unsupervised statistical learning involves 
Clustering. 

* The inputs go by different names as predictors, 
independent variables, features and it is typically 
denoted by X’s. 


°The output variable is called response or 
dependent variable, and is typically denoted using 
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Statistical learning: 


-We can model the relationship as 


*Where f is an unknown function or systematic 
information 

‘ € is a random error with mean zero. 

- Here f is some fixed but unknown function of X,,..., 
X,, 


p 
Ex:- Suppose that we are statistical consultant hired 
by a client to provide advice on how to improve sales of 
a particular product for three different media: TV, 
radio, and newspaper. 
In this setting, the advertising budgets are input 
variables while sales is an output variable 


Why Estimate f? 


»statistical learning refers to using the data to “learn” f. 
>There are 2 reasons for estimat Y = j(X), (2.2) 
1) Prediction and 2) Inference. 


If we can produce a good estimate for f (the variance of ¢ is 
not too large) we can make accurate predictions for the 
response, Y, based on a new value of X. 


Wher represents our estimate for f and 


represents the resulting prediction 
» Alterfeatively, we may also be interested in the type of 
relationship between Y and the X's. 


~For example to infer , 
*Which particular predictors actually affect the response? 
*Is the relationship positive or negative? 


»Ts the relationship a simple linear one or is it more 
complicated etc.? 


1. Prediction 
Y = f(X), (2.2) 


Wheré represents our estimate for £ and 

represents the resulting prediction for 
Y. 
The accuracy of Y depends on 2 Quantities:- 
1.) Reducible Error 
2.) Irredugft'le Error 
Case(i);- If is not equal to f, then it is nota 
perfect esfiinate [] reducible error 
Case(ii):- If is equal to f. then it is a nerfect 
estin E(Y¥-Y) = Elf(X)+e-f(x)P 

= [f(X)-f(X)+ Vale) , 
—— 


Reducible Irreducible 


DATA SCIENCE 


2. Inference 


» Alternatively, we may also be interested in the 
type of relationship between Y and the X's. For 
example, 

» Which particular predictors actually affect the response? 

*Is the relationship positive or negative? 

»Is the relationship a simple linear one or is it more 

complicated etc.? 

» In contrast consider the Advertising data illustrated in 
the above concept, One may be interested in answering 
such questions: 

» Which media contribute to sales? 


/ lia generates the biggest boost on sales? 
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How Do We Estimate f? 


>We will assume we have observed a set of 
training data 


»We must then use the training data and a 
Statistical method to estimate f. 

» Statistical Learning Methods: 
» Parametric Methods 
» Non-parametric Methods 
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Parametric Methods 


*It reduces the problem of estimating f down to one of 
estimating a set of parameters, involve a two-step model 
based approach 

>STEP 1: Make some assumption about the functional form 
of f, i.e. come up with a model. The most common example 
is a linear model i.e. 


However, we will examine more complicated, and flexible, 
models for f In a sense the more flexible the model the 
more realistic it is. 

SIEP Z: 

Use the training data to fit the model i.e. estimate for 

equivalently the unknown parameters such as Bo, Bi, Bz,..., Bo. 

The most common approach for estimating the parameters 

in a linear model is ordinary least squares (OLS). 
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Non-parametric Methods 


»They do not make explicit assumptions about the 
functional form of f. 

°. A thin-plate spline is used to estimate f. This 
approach does not impose any pre-specified model 
on f. It instead attempts to produce an estimate for 
f that Is as Close as possible to the observed data, 
subject to the fit—that is, the yellow surface In 
Figure 2.5 

» Advantages: They accurately fit a wider range of 
possible shapes of f. 

>» Disadvantages: A very large number of 
observations is required to obtain an accurate 
estimate of f 


Thin-Plate Splin 


* Non-linear regression 
methods are more 
flexible and can 
potentially provide 
more accurate 
estimates. 
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e Fctimate 


A smooth thin plate spline 
such that Income data Is 
yellow and observations 
are displayed in red 
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The Trade-Off Between Prediction 
Accuracy and Model Interpretability 


*Of the many methods that we examine are less 
flexible, or more restrictive, in the sense that they 
can produce just a relatively small range of 
Shapes to estimate f. 


For example, linear regression is a relatively 
inflexible approach, because it can only generate 
linear functions 

*Other methods, such as the thin plate splines are 
considerably more flexible because they can 


generate a much wider range of possible shapes 
to estimate f. 


- There are several reasons that we might prefer a 
more restrictive model. If we are mainly 
interactad in inference then rectrictive moadeale 
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The Trade-Off Between Prediction Accuracy 


and Model Interpretability 


Subset Selection 


High 


Least Squares 


Interpretability 


Low 


Low 


A representation of the tradeoff 
between flexibility and 
interpretability, using different 
Statistical learning methods. In 
general, as the flexibility of a 


mathnon iIncrrancac ite 


Generalized Additive Models 
Trees 


Flexibility 


Bagging, Boosting 


Support Vector Machines 


High 
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Lasso Regression: Lasso is a regression analysis 
method that performs both variable selection and 
regularization in order to enhance the prediction 
accuracy and interpretability of the resulting 
Statistical model. 
Least Squares: Least square method is the process 
of finding the best-fitting curve or line of best fit for 
a set of data points by reducing the sum of the 
Squares of the offsets (residual part) of the points 
from the curve 
¢ GAM:Generalized 

Additive Models(GAMs) are a flexible extension of 


linear models that allow for non-linear relationshi 
ps 

¢ Bagging (Bootstrap Aggregating) is a technique 
that involves training multiple models on different 


rs , oc. i: i: i ce 
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Types of Statistical Learning: 


»Supervised learning problems are divided into 
1) regression and 2) classification problems. 


1. Regression covers situations where Y is 
continuous/numerical. e.g. 
1. Predicting the value of the Dow stock index in 6 months. 
2. Predicting the value of a given house based on various 
inputs. 
2. Classification covers situations where Y is 
categorical e.g. 
» Will the Dow be up (U) or down (D) in 6 months? 
»Is this email a SPAM or not? 


With unsupervised statistical learning, there are inputs but no 
Supervising output; nevertheless we can learn relationships 
and structure from such data 
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Assessing Model Accuracy 


°-We use different statistical approaches/methods for 
building solution for a data analytical problem. 

*One statistical method may work well with a specific 
dataset and some other method may work better on a 
different dataset. 

°So it is important to decide for a particular dataset 
which method produces best results. 

° In order to evaluate the performance of a statistical 
learning method, we need to measure how well its 
predictions actually match the observed data. 

* Today we will look into different measures which help 
us in assessing the model accuracy: 

* Mean Squared Error 
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Measuring the Quality of Fit 


Mean Squared Error (MSE) 


In regression Analysis, the most commonly used measure 
is Mean Squared Error, given by the equation: 


MSE = Ese yy 
nl 


*Where ); is the prediction method gives for the observation 
in our training data 


°*The MSE will be small if the predicted responses are very 
close to the true responses, and will be large if for some of 
the observations, the predicted and true responses differ 
substantially. 

*The MSE calculated above is called training MSE. But the 
accuracy of test MSE, i.e. the MSE of previously unseen 
test data on the model is important 
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Measuring the Quality of Fit 


To state it more mathematically, suppose that we 
fit our statistical learning method on our training 
observations {(x1, y1),(x2, y2),...,(xn, yn)}, and we 
obtain the estimate f. We can then compute f(x1), 
{(x2),..., f(xn). If these are approximately equal to 
yl, y2,...,yn, then the training MSE is small. 
However, we are really not interested in whether 
“f(xi) = yi; instead, we want to know whether f(xO) is 
approximately equal to yO, where (xO, yO) is a 
previously ur ,__ - 2 not used to train 
the statistica “Yo — f(t0))’, 
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Descriptive statistics 


The field of statistics can be broken down into two 
broad categories, descriptive’ statistics and 
inferential statistics. 

1) Descriptive Statistics 2) Inferential 

Statistics 


1. Descriptive Statistics allow a researcher to 
describe or summarize their data. For example, 
descriptive statistics for a study using human 
Subjects might include the sample size, mean age 
of participants, percentage of males and females, 
range of scores on aé_ study measure, etc. 
Descriptive statistics are often briefly presented at 
the beginning of the Results section 
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Descriptive Statistics 


1.) Mean:- Sum of Observations / Total No of 
Observations 


2.) Mode:- Frequent element in the given dataset 


3.) Median:- The middle value after arranging the 
data either in ascending or descending 
order 


4.) Variance:- The difference between the mean 
value from every 
observation in the given data 


5.) Standard Deviation:- The Square root of 


NM 7 _.° . _. _ 
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Linear Regression 


°In statistics, linear regression is a linear approach to 
modeling the relationship between a scalar response (or 
dependent variable) and one or more explanatory 
variables (or independent variables). 


-The case of one explanatory variable is called simple 
linear regression. For more than one explanatory 
variable, the process is called multiple linear regression. 


*Simply, Linear regression finds the relationship between 
one or more predictor variable(s) and one outcome 
variable. 

-For example, it can be used to quantify the relative 
impacts of age, gender, and diet (the predictor 
variables) on height (the outcome variable 
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Simple Linear Regression 


° The very simplest case of a single scalar predictor variable x 
and a single scalar response variable y is known as simple 


linear regression. 

*In simple linear regression a single independent 
variable is used to predict the value of a dependent 
variable. In multiple linear regression two or 
more independent variables are used to predict the 
value of a dependent variable. The difference 


betweer 2ndent 
variable Population aie Independent Sobral in gle 

Y intercept Coefficient Variable term 
depende Dependent ‘ et 


‘a 
Y¥; = Bo + B,X; + € 


Linear component Random Error 


component 
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Estimating the Coefficients 


¢ In practice, BO and 81 are unknown. So 
before we can use to make predictions, we 
must use data to estimate the coefficients. 
Let (x1, yl), (x2, y2),..., (xn, yn) represent n 
observation pairs 
¢ Let yi = BO + B'1, xi be the prediction for Y 


based on the ith value of X. Then ei = yi —y’i 


represents the ith residual—this is the 


difference between residual the ith observed 


response value and the ith response value 
that is predicted by our linear model. We 


define th RSS =ej7+e3+---+e,, as (RSS) asr 
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62 3. Linear Regression 


Sales 


RSS = (y — By — Biz)? + (yo—Bo— Bite)’ +...+(Yn—Bo—bitn)’. 
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(xi-xa)| Sq(xi-xa) yl 


ya) 


Linear Regression 


eS) 
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| Mean} 78 | 


Studen (xi-xa)} Sq(xi-xa) 
1) 9517, 28D BH 81 3687.95 


285, AS 
3 800 
4, 70, -8} SCA 
S| 60| -18} 3.2470 


Mean 78 14677] 77 
0.6438356 
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Assessing the Accuracy of the Model 


*Once we have rejected the null hypothesis in favor 
of the alternative hypothesis it is natural to want 


to quantify the extent to which the model fits the 
data. 


* The quality of a linear regression fit is typically 
assessed using two related quantities: the residual 
ctandard arrnr (RSF) and the R? statistic 


Quantity Value 
Residual standard error | 3.26 
2 0.612 


ee RSE = ,/ RSS = 
n—2 


F-statistic 
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Multiple Linear Regression 


>The extension of simple linear regression to multiple and/or 
vector-valued predictor variables (denoted with a capital X) is 
known as multiple linear regression, also known as 
multivariable linear regression. 

» Multiple linear regression is a generalization of simple linear 
regression to the case of more than one independent variable, 
and a special case of general linear models, restricted to one 


c j 
eee = Bo aa By Xiy + Bo X59 ++ Bp Xip ee ear 


for each observation j= 1,..., 7. 

*In the formula above we consider n observations of one 
dependent variable and p independent variables. Thus, Y, is 
the r observation of the dependent variable, X, is 7 
observation of the j/" independent variable, j = 1, 2, ..., p. The 
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Estimating the Coefficients 


- nm \2 o* 
Var(fi) = SE(fi)" = —, 


¢ The above formula can be stated as standard 
error in estimating the coefficients (slope / 
intercept) in the regression equation. 

¢ where o is the standard deviation. Roughly 
speaking, the standard error tells us the average 
amount that this estimate “p differs from the 

; we 


i z 


q 2 
A ee 8 ud 
h SE(89) =a E + 
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K-Nearest Neighbor Algorithm: 


° K-nearest neighbors (KNN) is a type of 
Supervised learning algorithm used for both 
regression and classification. KNN tries to predict 
the correct class for the test data by calculating 
the distance between the test data and all the 
training points. 


Suppose there are two categories, i.e., Category A 
and Category B, and we have a new data point x1, 
so this data point will lie in which of these 
categories. To solve this type of problem, we need 
a K-NN algorithm. 

° The Primary difference between KNN 
classification and regression is very thin line 
where as knn classifier predicts a class by using 
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KNN classifier vs regressor 


Knn Regression: Predicts avalue by using 
the mean of the k nearest neighbors. 

Regression model: codomain of model is a 
continuous space 

Classic eer annals aetiewns “er MOU iS. a 
discrete s 
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Eucidean-measure between the data points. 


The calculated Euclidean distances must be arranged in ascending orde 
Initialize k and take the first k distances from the sorted list. 


